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Abstract —In this paper, we propose a novel deep neural 
network framework embedded with low-level features (LCNN) 
for salient object detection in complex images. We utilise the 
advantage of convolutional neural networks to automatically 
learn the high-level features that capture the structured infor¬ 
mation and semantic context in the image. In order to better 
adapt a CNN model into the saliency task, we redesign the 
network architecture based on the small-scale datasets. Several 
low-level features are extracted, which can effectively capture 
contrast and spatial information in the salient regions, and 
incorporated to compensate with the learned high-level features 
at the output of the last fully connected layer. The concatenated 
feature vector is further fed into a hinge-loss SVM detector in 
a joint discriminative learning manner and the final saliency 
score of each region within the bounding box is obtained by 
the linear combination of the detector’s weights. Experiments 
on three challenging benchmarks (MSRA-5000, PASCAL-S, 
ECCSD) demonstrate our algorithm to be effective and superior 
than most low-level oriented state-of-the-arts in terms of P-R 
curves. E-measure and mean absolute errors. 

Index Terms —Convolutional Neural Networks, Eeature Learn¬ 
ing, Saliency Detection. 

1. Introduction 

H umans have the capability to quickly prioritize ex¬ 
ternal visual stimuli and localize interesting regions in 
a scene. In recent years, visual attention has become an 
important research problem in both neuroscience and computer 
vision. One focuses on eye fixation prediction to investigate 
the mechanism of human visual systems [1] whereas the other 
concentrates on salient object detection to accurately identify 
a region of interest [2]. Saliency detection has served as 
a pre-processing procedure for many vision tasks, such as 
collages [3], image compression [4], stylized rendering [5], 
object recognition [6], image retargeting [7], etc. 

In this work, we focus on accurate saliency detection. Re¬ 
cently, many low-level features directly extracted from images 
have been explored. It has been verified that colour contrast is 
a primary cue for obtaining satisfactory results [8], [5]. Other 
representations based on the low-level features try to exploit 
the intrinsic textural difference between the foreground and 
background, including focusness [9], textual distinctiveness 
[10], and structure descriptor [11]. They perform well on 
simple benchmarks, but can still struggle in images of complex 
scenarios since semantic context hidden in the image cannot 
be effectively captured by hand-crafted low-level priors (see 
Figure 1(b)). 

Due to the shortcomings of low-level features, several 
methods have been proposed recently to incorporate high 
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Fig. 1. Saliency detection results by different methods, (a) input images; 
(b) low-level contrast features by [5]; (c) low-level priors with high-level 
objectness cues by [12]; (d) our LCNN algorithm, which combines high-level 
features embedded with low-level priors learned by CNN; (e) ground truth. 


level features [13], [9]. One type of such representations that 
can be employed is the notion of objectness [14], i.e., how 
likely a given region is an object. For instance, Jiang et 
al [9] computes the saliency map by combining objectness 
values of the candidate windows. However, using the existent 
foreground detectors [15], [16] directly to compute saliency 
may produce unsatisfying results in complex scenes when the 
objectness score fails to predict true salient object regions (see 
Figure 1(c)). 

The classic convolutional neural network paradigm [17], 
[18] has demonstrated superior performance in image clas¬ 
sification and detection on the challenging databases with 
complex background and layout in the images (for instance, 
PASCAL and ImageNet), which arises from its ability to 
automatically learn high-level features via a layer-to-layer 
propagation. This is fundamentally different from previous 
‘objectness’ work combining low-level priors. Due to the 
different application background and the scale of datasets, 
however, a successful adaption of deep model to saliency 
detection requires a smaller architecture design, a proper 
definition of the training examples, some refinement scheme 
such as a low-level feature embedded network, etc. 

In this paper, we formulate a novel deep neural network with 
low-level feature embedded, namely LCNN, which simulta¬ 
neously leverages the advantage of CNN to capture the high- 
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Fig. 2. Pipeline of the low-level feature embedded deep architecture (LCNN). 


level features and that of the contrast and spatial information 
in low-level features. To further facilitate the discriminative 
characteristics of the network, we combine those extracted 
features in a joint learning manner via the hinge-loss SVM 
detector. Figure 1(d) shows the superior advantage of such a 
deep architecture design, where traditional low-level oriented 
method [5] or high-level objectness-guided algorithm [12] fails 
to detect the salient regions in the complex image scenarios 
(for example, the salient region has similar colour or texture 
appearance with the background or it is surrounded by the 
complicated background). 

Figure 2 depicts the general pipeline of our method. First, 
a set of candidate bounding boxes with internal region masks 
are generated by the selective search method [19]; Next, the 
warped patches are fed into the deep network to extract high- 
level features. We make amendments of the classic CNN 
architecture for adaption to the saliency detection problem; 
Third, a series of simple and effective low-level descriptors 
are extracted from the regions within each bounding box; 
Finally, the concatenated feature vector is fed as input to the 
discriminative SVM detector and the saliency map is generated 
from the summation of the detector’s confidence score. The 
experimental results show that the proposed method achieves 
superior performance in various evaluation metrics against the 
state-of-the-art approaches on three challenging benchmarks. 

The rest of our paper reviews related works in section II, 
describes in detail our CNN framework in section III and 
low-level feature embedded scheme in section IV, verifies the 
proposed model in section V and concludes the work in section 
VI. Finally, the results and codes will be shared online upon 
acceptance. 

II. Related Works 

In this section, we discuss the related saliency detection 
methods and their connection to generic object detection 
algorithms. In addition, we also briefiy review deep neural 
networks that are closely related to this work. 

Saliency estimation methods can be explored from differ¬ 
ent perspectives. Basically, most works employ a bottom-up 
approach via low-level features while a few incorporate a 
top-down solution driven by specific tasks. In the seminal 
work by Itti et al. [20], center-surround differences across 
multi-scales of image features are computed to detect local 
conspicuity. Ma and Zhang [21] utilize color contrast in a local 


neighborhood as a measure of saliency. In [22], the saliency 
values are measured by the equilibrium distribution of Markov 
chains over different feature maps. Achanta et al [2] estimate 
visual saliency by computing the colour difference between 
each pixel w.r.t its mean. Histogram-based global contrast and 
spatial coherence are used in [5] to detect saliency. Liu et al 
[23] propose a set of features from both local and global views, 
which are integrated by a conditional random field to generate 
a saliency map. In [8], two contrast measures based on the 
uniqueness and spatial distribution of regions are defined for 
saliency detection. To identify small high contrast regions, [24] 
propose a multi-layer approach to analyse the saliency cues. A 
regression model is proposed in [25] to directly map regional 
feature vectors to saliency scores. Recently, [26] present a 
background measurement scheme to utilise boundary prior for 
saliency detection. Liu et al [27] solve saliency detection 
in a novel partial differential equation manner, where the 
saliency of certain seeds are propagated until the equilibrium 
in the image is ensured. In [28], colour contrast in higher 
dimension space is investigated to diversify the distinctness 
among superpixels. 

Although significant advances have been made, most of 
the aforementioned methods integrate hand-crafted features 
heuristically to generate the final saliency map, and do not 
perform well on challenging benchmarks. In contrast, we 
devise a deep network based method embedded with simple 
low-level priors (LCNN) to automatically learn features that 
disclosure the internal properties of regions and semantic 
context in complex scenarios. 

Generic object detection methods aim at generating the 
locations of all category independent objects in an image 
and have attracted growing interest in recent years. Existing 
techniques propose object candidates by either measuring the 
objectness of an image window [14], [15] or grouping regions 
in a bottom-up process [16]. The generated object candi¬ 
dates can significantly reduce the search space of category 
specific object detectors, which in turn helps other stages 
for recognition and other tasks. To this end, generic object 
detection are closely related to salient object detection. In 
[14], saliency score is utilized as objectness measurement 
to generate object candidates. [12] use a graphical model to 
exploit the relationship of objectness and saliency cues for 
salient object detection. In [29], a random forest model is 
trained to predict the saliency score of an object candidate. 
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TABLE I 

Architecture details oe the proposed deep networks. C: Convolutional layer; F: Fully-connected layer; P: Pooling layer; R: 
Rectieied linear unit (ReLU); N: Local response normalization (LRN); D: Dropout scheme; Channel: The number oe output 

EEATURE MAPS; PADDING: THE NUMBER OE PIXELS TO ADD TO EACH SIDE OE THE INPUT DURING CONVOLUTION. 


Layer 

1 

2 

3 

4 

5 

6 

7 

Type 

C+R-hP+N 

C+R+P+N 

C+R 

C+R+P 

C+R+P 

F+R+D 

F+R+D 

Input size 

227 X 227 

27 X 27 

13 X 13 

13 X 13 

6x6 

2x2 

512 + 104 

Channel 

96 

256 

384 

384 

256 

- 

- 

Filter size 

11 X 11 

5x5 

3x3 

3x3 

3x3 

- 

- 

Filter stride 

4 

- 

- 

- 

- 

- 

- 

Padding 

- 

2 

1 

1 

1 

- 

- 

Pooling size 

3x3 

3x3 

- 

3x3 

3x3 

- 

- 

Pooling stride 

2 

2 

- 

2 

3 

- 

- 


In this work, we utilise the selective search method [19] to 
generate a series of potential foreground bounding boxes as a 
preliminary preparation for the inputs of the deep network. 

Deep neural networks have achieved state-of-the-art results 
in image classification [30], [31], object detection [32], [33] 
and scene parsing [34], [35]. The success stems from the 
expressibility and capacity of deep architectures that facilitates 
learning complex features and models to account for interacted 
relationships directly from training examples. Since DNNs 
mainly take image patches as inputs, they tend to fail in 
capturing long range label dependencies for scene parsing as 
well as saliency detection. To address this issue, [35] use 
a recurrent convolutional neural network to consider large 
contexts. In [34], a DNN is applied in a multi-scale manner to 
learn hierarchical feature representations for scene labeling. 
We propose a revised CNN pipeline with low-level feature 
embedded to consider the label (region) dependencies based 
on contrast and spatial descriptors, which is of vital importance 
in the saliency detection task. 

III. CNN BASFD Salifncy Dftfction 

The motivation of applying CNN to saliency detection 
is that the network can automatically learn structured and 
representative features via a layer-to-layer hierarchical propa¬ 
gation scheme, where we do not have to design complicated 
hand-crafted features. The key points to make CNN work 
for saliency are (a): redesigned network architecture, which 
means, unlike [18] on the ImageNet [36], too many layers 
or parameters will burden the computation in a relatively 
small-scale saliency dataset; (b): proper definition of positive 
training examples, that is to say, considering the size of various 
(maybe multiple) salient object(s), how to define a positive 
region within the box compared with the ground truth; (c) 
how to add some ‘refinement’ scheme at the output of the last 
layer to better fit in the accurate saliency detection. Through 
section III-A to III-C, we will disclosure the solutions of the 
aforementioned issues respectively. 

A. Network architecture 

The proposed CNN consists of seven layers, with five 
convolutional layers and two fully connected layers. Each 


layer contains leamable parameters and consists of a linear 
transformation followed by a nonlinear mapping, which is 
implemented by rectified linear units (ReLUs) [17] to ac¬ 
celerate the training process. Local response normalization 
(LRN) is applied to the first two layers to help generalization. 
Max pooling is applied to all convolutional layers except for 
the third layer to ensure translational invariance. The dropout 
scheme is utilized after the first and the second fully connected 
layers to avoid overfitting. The network takes as input a warped 
RGB image patch of size 227 x 227, and outputs a 512- 
dimension feature vector for the SVM detector^. The detailed 
architecture of the network is shown in Table I. 

To generate the squared patches both for training and test, 
we first use the selective search method [19] to propose around 
2,000 boxes, each of which also includes the region mask 
segmented in different color spaces by [37]. Note that we 
take a preliminary selection scheme to filter out small boxes 
or those whose region mask accounts for little area w.r.t. the 
whole box. Then we warp all pixels in the tight bounding box 
around it to the required size. Prior to warping, we pad the 
box to include more local context as does [18]. 

B. Network training 

Training data. To label the training boxes, we mainly 
consider the intersection between the bounding box and the 
ground truth mask. A box B is considered as positive sample 
if it sufficiently overlaps with the ground truth region G: 
\B D G\ > 0.7 X max(|5|, |G|); similarly, a box is labeled 
as negative sample if \B D G\ < 0.3 x max(|5|, \G\). The 
remaining samples labeled as neither positive nor negative are 
not used. Lollowing [17], we do not pre-process the training 
samples, except for subtracting the mean values over the 
training set from each pixel. The labelling criteria and the 
process of patch generation are illustrated in Ligure 3(a)-(b). 

Cost function. Given the training box set {Bi}^ and the 
corresponding label set {Vi}^, we use the softmax loss with 

^ In the original CNN framework, layer 7 outputs the same feature length 
(1024-dimension) as layer 6 does. In order to better balance between high- 
level and low-level features, we reduce the output number of layer 7 to 
512-dimension. Note that in latter experiments without the low-level feature 
embedded architecture, layer 7 still outputs a 1024-dimension feature vector. 
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Fig. 3. (a) Illustration of labelling; (b) Generation of patches. Note that the orange region inside each padded sample is the ‘cell unit’ in our task, which 
means we use it to extract low-level features and compute saliency; (c) Visualization of the 96 learned hlters in the first layer. 


weight decay as the cost function: 

^ m 1 7 

EE'^(2/i,j)iogP(2/i = j|0) + l|w,|| 

i=l ji=0 k=l 

( 1 ) 

where 6 denotes the leamable parameters set of CNN in¬ 
cluding the weights and bias of all layers; 6 is the indicator 
function; P{yi = j\6) is the label probability of the i-th 
training example predicted by CNN; A is the weight decay 
parameter; and W/^ indicates the weight of the ^-th layer. CNN 
is trained using stochastic gradient descent with a batch size 
of m = 256, momentum of 0.9, and weight decay of 0.0005. 
The learning rate is initially set to 0.01 and is decreased 
by a factor of 0.1 when the cost is stabilized. Figure 3(c) 
illustrates the learned convolutional filters in the first layer, 
which capture color, contrast, edge and pattern information of 
the local neighborhoods. 

C. CNN for Saliency detection 

During the test stage, we feed the trained network with 
padded and warped patches and predict the saliency score 
of each bounding box using the probability P{y = 1\0). A 
primitive saliency map is obtained by summing up the saliency 
scores of all the candidate regions within the proposed bound¬ 
ing boxes. Figure 4(b) shows the result of directly applying 
CNN’s last layer as the saliency detector to generate saliency 
maps, which is denoted as the baseline model. However, as 
are shown in later experiment (section V-B) and [18], such 
a straightforward strategy may suffer from the definition of 
positive examples used in training the network, which does not 
emphasise the precise salient localisation within the bounding 
boxes. 

To this end, we introduce a discriminative learning method 
using the h hinge-loss SVM to further classify the extracted 
high-level features {i.e., the output of layer 7). The objective 
function is formulated as: 

1 ^ 

arg min-11^11^ + C V'max(0,1 - Viw'^Xi) (2) 

w Z ^^ 


where w is the weights of the SVM detector and C the penalty 
coefficient. Here we set C = 0.001 to ensure the computation 
efficiency. The revised saliency score of each bounding box or 
internal region is calculated sls w Pb, where w, b represent 
the weights and biases of the detector and being the output 
feature vector of the fc7 layer. Figure 4(c) depicts the visual 
enhancement of the saliency maps after enforcing a SVM 
mechanism, which can discriminatively choose representative 
high-level features to determine saliency for the region. 

So far, the CNN framework with a SVM detector predicts 
saliency values based solely on the automatic learned high- 
level features, which can include high-level semantic context 
in the image via the box padding and a layer-to-layer propaga¬ 
tion scheme. We find by adding some simple low-level priors, 
such as contrast or geometric information, the CNN framework 
could obtain much more enhanced results. 

IV. LCNN: Low-level Feature Embedded CNN 

The motivation why high-level feature from CNN alone 
is not enough can be explained as follows. The CNN-based 
prediction determines saliency solely based on how a particular 
sub-region looks like an object bounding box; the low-level 
saliency methods are typically cued on contrast or spatial cues 
from the global context, which is another valuable information 
missing in the somewhat ‘local’ CNN prediction. In this 
section, we propose a small, and yet effective, set of simple 
low-level features to compensate with those high-level features 
in a joint learning spirit. Different from [25] where too many 
low-level features are proved to be redundant [38], we use 
the most common priors, such as colour contrast and spatial 
properties. To enlarge the feature space diversity, we also 
explore the texture information in the image by extracting LBP 
feature [39] and LM filter banks [40]. 

A. Exploring low-level features 

The proposed 104-dimensional low-level features covers a 
wide diversity from the colour and texture contrast of a region 
to the spatial properties of a bounding box. First, given a 
region R within the bounding box generated by the selective 
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Fig. 4. Resultant saliency maps of different architecture design, (a) input image; (b) baseline model; (c) CNN with SVM detector; (d) CNN with spatial 
descriptors alone; (e) CNN with contrast descriptors alone; (f) CNN with low-level features (contrast and spatial descriptors together); (g) the proposed LCNN. 


search method and using the RGB colour space as an example, 
we compute its RGB histogram average RGB values 

and RGB color variance over all the pixels 

in the candidate region. Then, in order to characterize the 
texture feature of the region, we calculate the max response 
histogram of LM filters the histogram of LBP feature 

the absolute response of LM filters as well as the 
variance of the LBP feature var^^ and the LM filters var^^. 
Furthermore, we define the border regions of 20 pixels width in 
four directions of the image as boundary regions^ and compute 
the measurements in a similarly way as 

defined above. Also we consider the colour histogram of 
the entire image in three colour spaces. Here CS denotes the 
three colour spaces and TX represents the two texture features 
extracted by LBP and LM. 

Equipped with the aforementioned definitions and notations, 
we define a series set of low-level features. For the contrast 
descriptors, we introduce the boundary colour contrast by 
the chi-square distance between the RGB 

histograms of the candidate region and the four boundary 
regions, and the Euclidean distance between 

their mean RGB values. The rest of the colour or texture 
contrast between the region and the boundary regions, or 
the entire image are computed similarly. For the spatial de¬ 
scriptors, we not only consider the geometric information of 
a bounding box, such as the aspect ratio, height/width and 
centroid coordinates, but also extract the internal colour and 
texture variance of the candidate region. Note that all the 
geometric features are normalised w.r.t. the image size. Finally, 
all the low-level features are summarised in Table II. 

B. LCNN for saliency detection 

We concatenate the low-level feature vector proposed above 
with the high-level feature vector generated from layer 7 and 
use them as input of the SVM detector (see Figure 2). The 

^ Since the boundary regions in different directions may have different 
appearance, we compute their measurements separately. For notation conve¬ 
nience, we denote the feature vectors of the boundary regions in each direction 
with a uniform subscript B. 


revised architecture, namely the low-level feature embedded 
CNN (LCNN), archives better performance than previous 
schemes. Note that prior to feeding the concatenated feature 
into the SVM detector, we pre-process the data by subtracting 
the mean and dividing the standard deviation of the feature 
elements. The final saliency map follows a similar pipeline as 
stated in section III-C and we refine the map on a pixel-wise 
level using the manifold ranking smoothing [7]. 

Figure 4(d)-(f) illustrates the different effects of low-level 
features. We can see that the contrast descriptors (row e) play a 
more important role than the spatial descriptors (row d) as the 
former considers the appearance distinction between the region 
and its surroundings. A combination of the low-level features 
into the CNN framework (row f) can effectively facilitate the 
accuracy of saliency detection since the low-level priors can 
catch up the distinctness between the salient regions and the 
image boundary (usually indicating the background in most 
cases.). Furthermore, as Figure 4(g) suggests, our final scheme 
(LCNN), which includes the SVM detector based on the low- 
level feature embedded deep network, can take advantage 
of both low-level priors and discriminative learning detector. 
Note that the bicycle and the person’s legs are effectively 
detected in such a framework whereas previous schemes fail 
to detection them in some way. Figure 5 in section V-B proves 
our architecture design in a quantitative manner. 

V. Experimental Results 

In this section, we first describe in details the experiment 
settings on datasets, evaluation metrics and training envi¬ 
ronment (V-A); then the ablation studies are conducted to 
verify each architecture strategy (V-B); finally we compare 
the proposed algorithm with the current state-of-the-arts both 
in a quantitative and qualitative manner (V-C). 

A. Setup 

The experiments are conducted on three benchmarks: 
MSRA-5000 [23], ECCSD [24] and PASCAL-S [29]. The 
MSRA-5000 dataset is widely used for saliency detection 
and covers a large variety of image contents. Most of the 
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TABLE II 

The detailed description OE low-level EEATURES. R denotes the absolute response OE LM EILTERS. 2 

d(Ai, A2) = (|ail - a 2 l|, • • • 5 |«lfc - CL 2 k \), WHERE k IS THE EEATURE DIMENSION OE VECTOR Ai AND A2; H2) = WITH b 

BEING THE NUMBER OE HISTOGRAM BINS. 


Contrast Descriptors (color and texture) 

Spatial/Property Descriptors 

Notation 

Dehnition 

Notation 

Definition 

Notation 

Definition 

Notation 

Definition 

o 

1 


C16 — C27 


Pi -P2 

centroid coordinates 

P22 — P24 


C5 - C 8 


C28 — C39 


P3 

box aspect ratio 

P25 - P27 

var^^ 

eg — C 12 

x2(hfsv^hfS'^) 

C40 — C51 

d(agS'",afS^) 

P4 

box width 

P27 - P30 

var^SV 

CIS 


C52 — C55 


P5 

box height 



Ci4 

x 2 (h^“^hU*’) 

C56 — C 59 

X2(h^^,h^^) 

P6 




C15 

X2(hgs'",hfs^) 

C60 — C 74 

d(rR,rB) 

P7 - P21 

vaPj^ 





(a) Ablation study on MSRA-5000 


(b)ECCSD 


(c) PASCAL'S 


(d) MSRA-5000 


Fig. 5. Ablation study on MSRA-5000 test dataset and quantitative comparison to previous methods on three benchmarks. 


images include only one salient object with high contrast to the 
background. The ECCSD dataset consists of 1000 images with 
complex scenes from the Internet and is more challenging. 
The newly released PASCAL-S dataset descends from the 
validation set of the PASCAL VOC 2012 segmentation chal¬ 
lenge. This dataset includes 850 natural images with multiple 
complex objects and cluttered backgrounds. The PASCAL- 
S dataset is arguably one of the most challenging saliency 
datasets without various design biases (e.g., center bias and 
color contrast bias). All the datasets is bundled with pixel- 
wise ground truth annotations. 

We evaluate the performance using precision-recall (PR) 
curves, F-measure and mean absolute error (MAE). The preci¬ 
sion and recall of a saliency map are computed by segmenting 
the map with a threshold, and comparing the resultant binary 
map with the ground truth. The PR curves demonstrate the 
mean precision and recall of different saliency maps at various 
thresholds. The F-measure is defined as: 

F =2± f3^)Precision x Recall 
^ I3‘^ Precision + Recall 

where Precision and Recall are computed using twice the 
mean saliency value of saliency maps as the threshold, and 
is set to 0.3. The MAE is the average per-pixel difference 
between saliency maps S and the ground truth GT: 

w H 

\S{x,y)-GT{x,y)\. (4) 

x=l y=l 


where W^H denotes the width and height of the saliency 
map, respectively. The metric takes the true negative saliency 
assignments into account whereas the precision and recall only 
favour the successfully assigned saliency to the salient pixels 

[41] . 

Since the MSRA-5000 dataset covers various scenarios 
and the PASCAL-S dataset contains images with complex 
structures, we randomly choose 2500 images from the MSRA- 
5000 dataset and 400 images from the PASCAL-S dataset to 
train the network. The remaining images are used for tests. 
Both horizontal refection and rescaling (±5%) are applied to 
all the training images to augment the training dataset. The 
training process is implemented using the Caffe framework 

[42] and initialised with default parameter setting as suggested 
in [17]. We train the network for roughly 80 epochs through 
the training set of 1.3 million samples, which takes three weeks 
on a NIVIDIA GTX 760 4GB GPU. 

B. Ablation studies 

Figure 5(a) investigates the performance distinction of dif¬ 
ferent architecture designs on MSRA-5000 test dataset in a 
quantitative manner. Note that without a preliminary selective 
search scheme (line 1), the network suffers from severe 
insufficient positive samples during training and lacks a proper 
foreground ‘guidance’ to predict saliency during test stage. 

^ Note that we round the values to 2 decimal digits. 


MAE = 




































TABLE III 

Quantitative results using F-measure (higher is better) and MAE (lower is better). The best three results are highlighted in 

RED, BLUE AND GREEN, RESPECTIVELY. 
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Dataset 

Metric 

GC 

HS 

MR 

PD 

SVO 

UFO 

HPS 

RB 

HCT 

BMS 

DSR 

LCNN 

ECCSD 

F-measure 

0.563 

0.63 

0.70 

0.58 

0.24 

0.64 

0.60 

0.67 

0.64 

0.62 

0.61 

0.71 

MAE 

0.22 

0.23 

0.19 

0.25 

0.41 

0.21 

0.25 

0.18 

0.20 

0.22 

0.24 

0.16 

PASCAL-S 

F-measure 

0.49 

0.54 

0.60 

0.53 

0.27 

0.55 

0.52 

0.61 

0.54 

0.58 

0.57 

0.65 

MAE 

0.25 

0.25 

0.21 

0.24 

0.37 

0.23 

0.26 

0.19 

0.23 

0.21 

0.24 

0.16 

MSRA-5000 

F-measure 

0.70 

0.77 

0.79 

0.71 

0.30 

0.77 

0.71 

0.78 

0.77 

0.75 

0.76 

0.79 

MAE 

0.15 

0.16 

0.13 

0.20 

0.36 

0.15 

0.21 

0.11 

0.14 

0.16 

0.14 

0.12 


Also the rough score summation of bounding boxes can only 
generate fuzzy and blurry saliency maps, which is incapable of 
conducing a precise salient object detection task. The baseline 
model (line 2) takes a primitive architecture of Table I without 
the final regression scheme and the introduction of low-level 
features. We can see the performance improves slightly after 
the incorporation of the SVM detector (line 3), particularly 
in the range of low recall values. Line 4-6 investigates the 
different effects of low-level features. We find that the contrast 
descriptors (line 5) plays a more important role to facilitate 
the saliency accuracy that does the spatial descriptors (line 4); 
and a combination of both contrast and spatial features (line 6) 
can effectively enhance the result. Finally, the SVM detector 
can discriminatively classify the extracted features into the 
foreground and the background, thus formulating our final 
version of the low-level feature embedded CNN architecture 
(line 7). 

C. Performance comparison 

We compare the proposed method (LCNN) with the tra¬ 
ditional low-level oriented algorithms as well as the newly 
published state-of-the-arts: IT [20], GB [22], FT [2], CA [3], 
RA [43], BS [44], LR [13], SVO [12], CB [45], SF [8], HC 
[5], PD [46], MR [47], HS [24], BMS[48], UFO [9], DSR 
[49], HPS [7], GC [41], RB [26], HCT [28]. We use either the 
implementations or the saliency maps provided by the authors 
for pair comparison. 

Our method performs favourably against the state-of-the- 
arts on three benchmarks in terms of P-R curves (Figure 5), 
F-measure as well as MAE scores (Table III). We achieve 
the highest F-measure value of 0.712, 0.648 and the lowest 
MAE of 0.161, 0.164 on the ECCSD and PASCAL-S dataset, 
respectively. And the performance on the MSRA-5000 dataset 
is very close to the best method [47]. Figure 6 reports the 
visual comparison of different saliency maps. Our algorithm 
can effectively catch key colour or structure information in 
complex image scenarios by both learning low-level features 
and high-level semantic context. 

VI. Conclusions 

In this paper, we address the salient object detection prob¬ 
lem by learning the high-level features via deep convolutional 


neural networks and incorporating the low-level features into 
the deep model to enhance the saliency accuracy. To further 
catch the discriminant semantic context in the complex image 
scenarios, we introduce a hinge-loss SVM detector to better 
distinguish the salient region(s) within each bounding box. 
Experimental results show that our algorithm achieves superior 
performance against the state-of-the-arts on three benchmarks. 
A straightforward extension to our method is to jointly learn 
global and local saliency context through a novel neural 
network architecture instead of relying on hand-crafted low- 
level features, which will be left as our future work. 
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